Cell-Probe Lower Bounds for Prefix Sums 

Emanuele Viola* 
June 7, 2009 



Abstract 

We prove that to store n bits x G {0, 1}" so that each prefix sum (a.k.a. rank) query 
Sum(z) := X]fc<ja;fc can be answered by non-adaptively probing q cells of Ign bits, one 
needs memory 

n + n/ log*^^"^^ n. 

This matches a recent upper bound of n + n/log^^'^-' n by Patra§cu (FOCS 2008), also 
non-adaptive. 

We also obtain a n + n/log^*^*'' n lower bound for storing a string of balanced 
brackets so that each Match(i) query can be answered by non-adaptively probing q 
cells. To obtain these bounds we show that a too efficient data structure allows us to 
break the correlations between query answers. 
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1 Introduction 



The problem of succinctly storing n bits x G {0, 1}" so that each prefix sum (a.k.a. rank) 
query Sum(2) := Ylik<i can be answered efficiently is a fundamental data structure problem 
that has been studied for more than two decades. The best known upper bound for this 
problem is a data structure by Patra§cu which answers queries by probing q cells of Ig n bits, 
and uses memory 

n + n/lg^^'^n, (1) 

see |Pat08] and the references there. 

We prove the first lower bound for this problem, matching Patra§cu's upper bound ([1]) for 
non-adaptive schemes. We remark that known upper bounds, including |Pat08] in Equation 
([1]), are also non- adaptive. 

Theorem 1.1 (Lower bound for prefix sums). To store {0, 1}" in [n]" so that each Sum{i) : = 
J2k<i-'^k query can be computed by non-adaptively probing q cells o/ lg2 n bits, one needs 
memory 

u ■\g2n > n — 1 + n/ \g^''^ n, 
where A is an absolute constant. 

Our techniques apply to other problems as well; for example we obtain the following lower 
bound for the problem of storing strings of balanced brackets so that indexes to matching 
brackets can be retrieved quickly. 

Theorem 1.2 (Lower bound for balanced brackets). To store Bal := {x G {0, 1}" : 

X corresponds to a string of balanced brackets} in [ra]", n even, so that each Match{i) query 
can be computed by non-adaptively probing q cells of lg2 n bits, one needs memory 

u ■ lg2n > n — 1 + n/ Ig^' n, 

where A is an absolute constant. 

The best known upper bound for this problem is again n + n/lg^'-^-'n, non-adaptive 
[PatOSj . It is an interesting open problem to close the gap between that and our lower 
bound of n + n/ Ig^^'' n. 

1.1 Techniques 

We now explain the techniques we use to prove our lower bound for prefix sums. We show that 
a too efficient data structure would allow us to break the dependencies between the various 
prefix sums and obtain the following contradiction: For three indexes l<p<i<j<n, a. 



1 



subset X C {0, 1}" of inputs, and some integers s, s': 



1/1000 > Pr 

xex 



Pi" 

xex 



^Xk>S 

.k<j 



k<i 



■ Pr 

xex 



k<i 



> — ■ — > 1/1000. 
- 10 10 ' 



(2) 

(3) 
(4) 



We now explain how we obtain these inequahties. 



Approximation ([3]) : For the Approximation ([3]) we ignore the integers s, s' and more 
generally show that the distributions of the sums Ylik<i -^k and X]fc<j are statistically close 
to independent. By the data structure, for every input x e {0, 1}" and index i the sum 
J2k<i-^k can be retrieved as a function di of q cells Q{i) C [u] of the encoding Enc{x) of x: 

'^Xk = di (Enc(x)lQ(j)) , 

k<i 

where Enc{x)\Q(^i) denotes the q cells of size n (Ign bits) of Enc{x) G [n]^ indexed by Q{i). 

An evident case in which the two sums ^fc<j Xk and J2k<j ^k could correlate (for a random 
input X e {0, l}*^) is when Q{i) f]Q{j) 7^ 0. To avoid this, we prove a separator Lemma [2^2] 
yielding a set B of size nig'' n such that there are n/ \g"' n ^ \B\ disjoint sets among 

Q{l)\B,Q{2)\B,...,Q{n)\B. 

We denote by V C [n] the set of indices of the disjoint sets. 

The proof of the separator lemma is an inductive argument based on coverings; it re- 
sembles arguments appearing in other contexts (cf. |Vio07[ Overview of the Proof of Lemma 
11]). 

Via an averaging argument we fix the values of the cells whose index G -B so as to have 
a set of inputs X C {0, 1}" of size |A| > 2"/[n]l-^l such that the n/\g°' n sums X]fc<i^A: for 
i & V can be recovered by reading disjoint cells, which we again denote Q{i) (so now we 
have Q{i) n Q{j) = for i,j G V^). 

This takes care of the "evident" problem that different prefix sums may be answered 
by reading overlapping sets of cells. Of course, there could be other types of correlations 
arising from the particular distribution that a random input x E X induces in the values of 
the cells. To handle these, we rely on a by-now standard information-theoretic lemma that 
guarantees the existence of a set of cell indices G C [u] such that any 2q cells whose indices 
G G are jointly uniformly distributed (cf. [VioOQl §2] and the references therein). Thus, if 
we find i,jGV such that Q{i) [jQ{j) C G we conclude the derivation of Approximation 
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([3]) as follows 



.fc<j k<j / 





where U denotes the uniform distribution over the cell values. The first equality is by 
definition, the approximations are by the information-theoretic lemma mentioned above, 
and the second equality holds because Q{i) f]Q{j) = 0. 

To conclude this part, it only remains to see that there are indeed i,j G V such that 
Q{i)[JQ{j) ^ G. The size of G is related to the redundancy r = n/lg'^'^n of the data 
structure, specifically \[u] \ G| = 0{q ■ r). As the sets Q{i) for i G V" are disjoint, we have 
a set V of indices of size at least \V'\ > \V\ — \ [u\\G\ > n/ \g"' n — q ■ r > Q{n/ lg"n) such 
that Approximation ([3]) is satisfied by any i,j G V. 



Inequalities ([2]) and (jlj). To prove Inequalities ([2]) and (jlj) we reason in two steps. First, 
we find three indices p < i < j, where i and j are both in V, such that the entropy of the 
j — p variables Xp+i . . . Xj, for x G X, is large even when conditioned on all the others before 
them, and moreover i — p > c{j — i) for a large constant c. Specifically, we have the picture 



X\X2 Xp Xp^\Xp^2 "^j -^j 



>c-d=c{j- 



and the guarantee that 

H{xp+i, Xp+2, . . . ,Xj\xi,X2, . . . , Xp) > (j -p)- e. (5) 

Second, from Equation (JSj) we obtain the integers s, s' to satisfy Inequahty (jl]). 

For the first step we start by selecting a subset of the indices in V that partitions [n] in 
intervals such that the first is > c times larger than the second, the third is > c times larger 
than the fourth and so on: 



'2k '^2fe+l — ^2fc+2 • • • 

A simple argument shows that we can find a subset of indices as above whose size is a 
f2(l/lgn) fraction of \V'\. 

We then view x G X as the concatenation of random variables Zq,Zi, . . each spanning 
two adjacent intervals: 



v[_v'^ ^3—^4- 



Xi. 



Z2 
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We would like to find a that has high entropy even when conditioned on the previous 
ones. We make the simple and key observation that known proofs of the information-theoretic 
lemma (e.g., the one in [SV08j ) give this. Specifically, we want to avoid the variables that 
have a distribution far from uniform as a result of the entropy we lost when we went from 
the set of all inputs {0, 1}" to X C {0, 1}". In the fixing we lost Ign ■ \B\ = n/ \g^~^ n bits 
of entropy (since each cell contains Ign bits), and by an information-theoretic lemma there 
are only 0{n/ lg^~^ n) bad variables Zj.. Since the number of variables Z^ is fi(|\^'|/ Ign) > 
n/ Ig""*"^ n, by taking 6 > a -f- 4 we see that most variables Z^ satisfy Equation Note here 
we rely on the separator lemma giving us a set B whose removal yields a number of disjoint 
sets \V\ = n/ Ig" n ^ n/lg'' n = \B\. 

For the second step, we reason as follows. For simplicity, let us think of a typical value 
t of xi + X2 + ■ ■ ■ + Xp for X E X (we actually pick a t which "cuts in half" the outputs of 
Xi + X2 + ■ ■ . + Xp). By Equation we can think of the distribution of Xp+i, Xp+2, . . . ,Xj 
as j — p independent coin tosses, and this is true even after we condition on the first p bits 
summing to t. We set the integers s := t + {j — p)/2 + c^^^y/d, and s' := t + {i — p)/2, where 
recall d := j — i. 

To see Inequality ([2]), note that for it to be true is must be the case that the sum of 
j — i = d coin tosses exceeds its mean by c^^^\/d, which has probability < 1/1000 for large 
enough c. 

For the Inequalities (jl]), note that Pr^gx [X]fc<i^fc < s'~\ has probability about 1/2 as 
it is just the probability that a sum of coin tosses does not exceed its mean. Finally, 

Pr^ex J2k<j > s is the probability that the sum oi j — p > c ■ d coin tosses exceeds its 

mean by c^^^^/d < c^^'^a/ (j — p)/c = Vj ~ p/c^^^', this probability is > 1/10 for a sufficiently 
large c. 

This completes the overview of our proof. 



Comparison with |Vio09j . In this section we compare our techniques with those in 



[VioO Qj . To illustrate the latter, consider the problem of storing n ternary elements ti, . . . , t„ G 
{0, 1,2} in u bits (not cells) so that each ternary element can be retrieved by reading just 
q bits. The main idea in |Vio09j is to use the fact that if the data structure is too suc- 
cinct, then, for a random input, the query answers are a function of q almost uniform bits. 
But the probability that a uniform ternary element in {0, 1, 2} is 2 equals 1/3, whereas the 
probability that a function of q uniform bits is 2 equals A/ 2'^ for some integer A. Since the 
gap between 1/3 and A/2'^ is Q{2~'^), if the q bits used to retrieve the ternary element are 
o(2~'')-close to uniform we reach a contradiction. 

This technique cannot be used when reading g > Ign bits, because one would need the 
bits to be 2~'^ < 1/n close to uniform, which cannot be guaranteed (cf. the parameters 
of Lemma 12.41 the dependence on the error parameter is tight up to constant factors as 
can be verified by conditioning on the event that the majority of the bits is 1). The same 
problem arises with non-boolean queries like prefix sums; probing two cells gives 2 Ig n bits 
and granularity 1/n^ in the probabilities of the query output, which can be designed to be 
at statistical distance < 1/n from the correct distribution. 
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This work departs from the idea of focusing on the distribution of a single query output, 
and rather focuses on the correlation between different query outputs. 

Organization: In §2] we formally state and prove our lower bound for prefix sums relying 
on three lemmas which are proven in §31 In §1] we prove our lower bound for matching 
brackets. We conclude in ^ with some open problems. 

2 Lower bound for prefix sums 

We now formally state the data structure problem and then recall our main result. 

Definition 2.1 (Data structure for prefix sums). We say that we store {0, 1}" in [n]" sup- 
porting prefix-sum queries by probing q cells if there is a map Enc : {0, 1}" — [n]^, n sets 
Q{1), . . . , Q{n) C [u] of size q each and n decoding functions di, . . . ,dn mapping [nY to [n] 
such that for every x G {0, 1}" and every i G [n].- 

Sum{i) := ^Xfe = di (Enc(x)|Q(i)) , 

k<i 

where i?nc(x)|Q(j) denotes the q cells of size n of Enc{x) G [n]" indexed by Q{i). 

Theorem 1.1 (Lower bound for prefix sums). (Restated.) To store {0,1}" in [n]" so that 
each Sum{i) := X]fc<j^fc query can he computed by non-adaptively probing q cells o/lggn 
bits, one needs memory 

M ■ lg2 n > n — 1 + n/ Ig"^ "^ n, 
where A is an absolute constant. 

The proof relies on a few lemmas which we describe next. 

The first is the separator lemma which shows that given any family of small subsets of 
some universe, we can remove a few w/g elements from the universe to find many w disjoint 
sets in our family. (The sets are disjoint if no element is contained in any two of them; the 
empty set is disjoint from anything else.) 

Lemma 2.2 (Separator). For every n sets Q{1),Q{2), . . . ,Q{n) of size q each and every 
desired "gap" g, there is w & [^/{q ' lY-, ^ set B of size \B\ <w/g such that there are 

> w disjoint sets among 

Qil)\B,Qi2)\B,...,Qin)\B. 

The next lemma shows that given a set V of w indices in [n] we can find a large subset 
of indices V' C V that partition [n] in intervals such that any interval starting at an even- 
indexed f ' is > c times as large as the next one: 

/ / / 
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Lemma 2.3 (Stretcher). Let 1 < Vi < V2 < ■ ■ ■ < Vyj < n he w indices in [n]. Let c > 1 
and n sufficiently large. Then there are w' = 2\w/{c- lgn)\ indices V := {v[, fj, . . . , v'^,} C 
{f 1, V2, ■ ■ ■ , Vw}, ordered as v[ < v'2 < ■ ■ ■ < v'^, , such that 

for every k = 0,1, . . . , w'/2 — 1, where v'q := 0. 

For the next lemmas recall the concept of entropy H of a. random variable X, de- 
fined as H{X) := ^^.PrfX = x] ■ lg(l/Pr[X = x]) and conditional entropy H{X\Y) : = 
Ey^YH{X\Y = y) (cf. [CTO6I Chapter 2]). 

The following is the information-theoretic lemma showing that if one conditions uni- 
formly distributed random variables Xi, . . . ,X„ on an event that happens with noticeable 
probability, then even following the conditioning most groups of q variables are close to being 
uniformly distributed. See |Vio09j and the references therein. 

Lemma 2.4 (Information-theoretic). Let X = (Xi,...,X„) be a collection of independent 
random variables where each one of them is uniformly distributed over a set S . Let ACS"' 
be an event such that Pt[X G A] > 2"'', and denote by {X[, . . . ,X4) the random variables 
conditioned on the event A. Then for any r] > and integer q there exists a set G C [n] 
such that \G\ > n — IQ ■ q ■ a/rf and for any q indices ii < 12 < ■ ■ ■ < iq E G we have the 
distribution {X[_^,X[^, . . . ,X[^) is rj-close to uniform. 

We need a variant of the above lemma where the variables keep high entropy even when 
conditioning on the ones before. 

Lemma 2.5 (Information-theoretic II). Let Z be uniformly distributed in a set X C {0, 1}" 
of size \X\ = 2"-'^. Let Z = {Zi, ...,Zk) where Zi e {0, l}'^* (so that Y,i<k Si = n). There 
is a set G C [k] of size \G\ > k — a/e such that for any i & G we have 

H{Zi\ZiZ2 . . . Zi_i) >Si-e. 

In particular, Zi is close to uniform over {0, 1}*'. 

Finally, the next lemma lets us turn high entropy of a block of variables conditioned on 
the previous ones into bounds on the probabilities ([2]) and (jlj) in the overview Section II. 1[ 

Lemma 2.6 (Entropy-sum). Let Xx,X2, ■ ■ ■ ,Xn &e — 1 random variables, and p < i < j 
three indices in [n] such that for i := {i — p) and d := j —i we have i > c-d for a sufficiently 
large c. Suppose that 

H{Xp+i, Xp+2, . . . , Xj\Xi, X2, . . . , Xp) > i + d — 1/c. 
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Then there exists an integer t such that 



Ft 

X 



^Xk>t + ^/2 + d/2 + c^/^Vrf 



.k<j 



Pr 

X 



^Xk<t + ^/2 



k<i 



Pr 

X 



^Xk>t + l/2 + d/2 + c^/^Vd l\'^Xk<t + l/2 

.k<j 



k<i 



> 1/10, and 

> 1/10, but 

< 1/1000(< 1/10-1/10). 



2.1 Proof of lower bound 

Let c be a fixed, sufficiently large constant to be determined later, and let n go to infinity. We 
prove the theorem for A := c + 1: we assume there exists a representation with redundancy 
n/lg"^''^?T, — 1 and derive a contradiction. First, we assume q > 1 for else the theorem is 
trivially true. We further assume that q < (logr;,)/2(c + 1) Iglgn for else the redundancy is 
< and again the theorem trivially true. 

Separator: We apply Lemma [2l2] to the sets Q{1), . . . ,Q{n) with gap g := Ig'^ n to obtain 
w E [n/ (g ■ Ig"^ n)^, n] and a set -B C [u] of size \B\ < w jlg^ n such that there are > w disjoint 
sets among 

Q{l)\B,Q{2)\B,...,Q{n)\B. 

Let these sets be 

QK)\5,g(t;2)\5,...,QK)\5, 

and let V := {^1,^2, . . . ,Vyj\ C [n] be the corresponding set of indices. Observe that w > 
n/{q ■ Ig'ny > r^/(lg^+l n)(^g")/2(c+i)igign > 

Over the choice of a uniform input x G {0, 1}", consider the most likely value z for the 
w/lg^n cells indexed by B. Let us fix this value for the cells. Since this is the most likely 
value, we are still decoding correctly a set X of 2"'/n''^' inputs. From now on we focus on this 
set of inputs. Since these values are fixed, we can modify our decoding as follows. For every 
i define Q'{i) := Q{i) \ B and also let d'- be di where the values of the probes corresponding 
to cells in B have been fixed to the corresponding value in z. By renaming variables, letting 
u' := u — \B\ and Enc : {0, 1}" [nY be Enc restricted to the cells in [u] \ B, we see that 
we are now encoding X in [n]** in the following sense: for every x E X and every i G [n]: 

5^Xfc = <(Enc'(x)|Q.), (6) 

k<i 

where note for any i, j E V we have Q'{i) C\Q'U) = ^■ 

Uniform cells: To the choice of a uniform x G X C {0, 1}" there corresponds a uniform 
encoding y eY C [n]^ , where 

|X| = |F| > 2"/nl^l = 2"-"'/is'"'". (7) 
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Let y = {yi, . . . ,yu') he selected uniformly in y C [n]"'. By Lemma [27il there is a set G C [u'] 
of size 



|G| >u'-lQ.2q. IgK /\Y\)-c'>u'-16-2q. Ig (^^^^ j 
= u — 32q ■ r ■ c^, 

where r := (ulgn) — n = n/ Ig"^'^ n — 1 is the redundancy of the data structure, such that for 
any 2q indices ki, ^2, • • • , the cells yk^, ■ ■ ■ , j/fcj, are jointly (l/c)-close to uniform. Since 
the sets Q'{vi), Q'{v2), ■ ■ ■ , Q'ivyj) C [u'] are disjoint, there is a set V2 such that for any 
hi ^ ^2 and y uniform in Y the distribution 



i.y\Q'(i)^y\Q'(j))y&Y, is 1/c close to uniform over [; 



}'{i)\ + \Q'{3)\ 



and the size of I Vol is 



ft ft 

\V2\>w-'i2q-r-^>w- 32q ■ ■ c'' = w - 32q ■ ■ c' > w/2, (9) 

Ig ^ n Ig n 

where the last inequality ([9]) holds because w > n/{q ■ Ig^ny. Specifically, the inequality is 
implied by ((lgn)/g)'' < 64g ■ c^, which is true because q < (lgn)/(2(c+ l)lglgn). 
Stretcher: Apply Lemma 12.31 to V2 to obtain a subset V3 C V2 of even size 

w' := iV^sl > 2[|\/2|/(c-lgn)J > 2[w;/(2c ■ lgn)J > w/(2c-lgn), (10) 

such that if f ^ < f 2 < . . . < v'^y^^ is an ordering of the elements of V3 we have 

^2fc+l - ^2fc > c(f2fc+2 - V2k+l) (11) 

for every A; = 0, 1, ... , w'/2 — 1. 

Entropy in input bits. For a uniform x G X consider the w'/2 random variables Zj. where 
for < /c < w'/2 Zk stands for the Sk bits of x from position + 1 to ^2^+2; 



Zk ■■= . . . xw G {0, 1} 



"2k^^ "2k^^ "2k+2 

and Zyji /2_i is padded to include the remaining bits as well, /2_i = x^i +1X^1 ^2 ■ ■ - x^' ... 



V ' " V ' V ' 

Zq Z\ Z2 

Recalling the bound ([7]) on the size of apply Lemma [2751 to conclude 

if(Zfc|ZiZ2...Zfc_i) >Sfc-l/c (12) 

for 

w' l2-c-wl\g^'^n> w;/(4c-lgn) - c-wjX^'^ n> 1 (13) 



8 



variables Z^. Fix an index k such that Equation (fT2l) holds for Z^, and let 



P ■= V2k, 



I := V. 



2fc + l5 



2k+2 



be the corresponding indices, where p is either or in V3 and C V3. We can rewrite 

Equation ( fT2l) as H{xp+iXp+2 ■ ■ ■Xj\xiX2 ■ ■ -Xp). Using in addition Equation ( ITTi) we are in 
the position to apply Lemma [2^61 {£ := {i — p) > c ■ d := c ■ {j — i)). Let t be the integer in 
the conclusion of Lemma 12.61 and let 

s ■=t + {£ + d)/2 + c^^^Vd, s':=t + £/2. 

Let U' denote the uniform distribution over the u' cells, i.e., over [n]"'. We have the following 
contradiction: 



Pr 



y^^xk > s /\y^xk < s' 



.k<j 



k<i 



> Pr 

U' 



Pr [4(i/|q;) > s /\ <(i/|q.) < s'] (By Equation 
d'^mQ',) >sf\d[{U'\Q.) < - 1/c (By dHD) 

d'^mQ',) > s] ■ Pr K(f^'lQ^) < - l/c (Because Q'(^) flQXj) = 0) 



Pr 
U' L 



> ( Pr^ [d'^iy\Q0 >s\- 1/cj ( Pr [<(|/|q,) < s'] - 1/c ) - 1/c (By m again) 



Pr 



.k<j 



1/c Pr 



E 

.fc<i 



— 1/c) — 1/c (By Equation (JH]) again) 



> (1/10 - 1/c) (1/10 - 1/c) - 1/c (By Lemma [ZSD 

> 1/200 (For large enough c) 

which contradicts Lemma [2. 6[ 



3 Lemmas 

In this section we restate and prove the lemmas needed for the proof of our main theorem. 

Lemma 2.2 (Separator). (Restated.) For every n sets Q{1),Q{2), . . . ,Q{n) of size q each 
and every desired "gap" g, there is w ^ [^/{q ' '?)'^)''^] (^iT'd a set B of size \B\ < w/g such 
that there are > w disjoint sets among 

Q{l)\B,Q{2)\B,...,Q{n)\B. 

Proof. Set ko := n/{g ■ qY. Initialize B := 0. Consider the following procedure with stages 
i = 0,1, ... ,q. We maintain the following invariants: (1) at the beginning of stage i our 
family consists of sets of size q—i and (2) at the beginning of any stage i > 1, \B\ < k^-g^'^q^. 



9 



The i-th stage: Consider the family (<5(1) \ B, Q{2) \B, . . . , Q{n) \ B). If it contains > 
ko{g-qy disjoint sets then we successfully terminate because by the invariant \B\ < k^-g^^^qK 

If not, there must exist a covering C of size k^i^g ■ qY^q — i) of the family, i.e., a set that 
intersects every element in our family. To see this, greedily collect in a set S as many disjoint 
sets from our family as possible. We know we will stop with 15*1 < kolg ■ qy. This means 
that every set in our family intersects some of the sets in 5*. Since the sets in the family 
have size at most {q — i), the set C of elements contained in any of the sets in 5* constitutes 
a covering and has size \C\ < ko^g ■ qy{q — i). 

Let B := B [J C . We now finish the stage. Note that we have reduced the size of our sets 
by 1, maintaining Invariant (1). To see that Invariant (2) is maintained, note that if i = 
then \B\ = \C\ < kg ■ q, as desired. Otherwise, for i > 1, note that by Invariant (2) and the 
bound on \C\ we have 

\B\ < \C\ + ko ■ g'-'q' < koig ■ q)\q - t) + ko ■ g'-'q' < ko ■ g' ■ q'^\ 

and thus Invariant (2) is maintained. 

To conclude, note that the procedure terminates at stage q at most, for at stage q our 
family consists of n = kQ{g ■ qY empty sets which are all disjoint. □ 

Lemma 2.3 (Stretcher). (Restated.) Let 1 < vi < V2 < ■ ■ ■ < < n he w indices 
in [n]. Let c > 1 and n sufficiently large. Then there are w' = 2\w/{c ■ lgn)\ indices 
V := {v'l, t>2, . . . , v'^,} C {vi,V2, . . . , Vu)}, ordered as v[ < v'2 < ■ ■ ■ < v'^,, such that 

'^2k+l ~ '^2k — (^\'^2k+2 ~ '^2k+l) 

for every k = 0,1, . . . , w'/2 — 1, where v'q := 0. 

Proof. Set s := 0,t := [c ■ IgraJ and define Vq := 0. While s < w — t, consider the first 
i: 0<i<t — 1 for which 

Vs+i -Vs> c{Vs+i+l - Vs+i). (14) 

Add Vs^i, Vs+i+i to v. Set s := s + i + 1 and repeat. 

This gives w' > 2[w/t\ > 2 [w/(c ■ lg?7.)J indices, as desired, assuming we can always find 
i : < i < t — 1 for which f|T^ holds. Suppose not. We have the following contradiction: 

Vs+t -Vs = Vs+t - Vs+t-l + Vs+t-1 - Vs 

> (1 + l/c){Vs+t-l - Vs) > (1 + llcY{Vs+t-2 -Vs)> ...> {1 + llcY~\Vs+l - Vs) 

> (1 + l/c)*-i = (1 + l/c)L^-'s"J-^ > (1 + l/c)^-^sn/(i ^ 1/^)2 > (2.25)'s"/(l + l/cf > n, 
for c > 1 and sufficiently large n. □ 

We list next a few standard properties of entropy that we will use in the proofs. 
Fact 1. Entropy satisfies the following. 
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1. Chain rule: For any random variables X, Y, and Z : H{X, Y\Z) = H{X\Z)+H{Y\X, Z) 
ICTOa Equation 2.21]. 

2. Conditioning reduces entropy: For any random variables X,Y,Z we have H{X\Y) > 
H{X\Y,Z) WJrm, Equations 2.60 and 2.92]. 

3. High entropy imphes uniform: Let X be a random variable taking values in a set S 
and suppose that H{X) > Ig 15*1 — a; then X is iy/a-close to uniform \CK8^ Chapter 
3; Exercise 17]. 

Lemma 2.5 (Information-theoretic II). (Restated.) Let Z be uniformly distributed in a 
set X C {0,1}" of size \X\ = 2"~". Let Z = {Zi,...,Zk) where Zi e {0,1}"" (so that 
Si<fc -^j = There is a set G [k] of size \G\ > k — a/e such that for any i E G we have 

H{Zi\ZiZ2 . . . Z,_i) > Si-e. 

In particular, Zi is close to uniform over {0, 1}"'. 

Proof. We have H{Z) = log |X| = n — a = ^Si — a. By the chain rule for entropy, 

Y,isi-H{Zi\Z,Z2...Zi_^)) 



a. 



i<k 



Applying Markov inequality to the non-negative random variable Sj — H{Zi\ZiZ2 . . . Zi_i] 
(for random z G [A;]), we have 

Pr [si - H{Z,\ZiZ2 . . . Zi_i) > e] < a/{k ■ e), 

i&[k] 



yielding the desired G. 

The "in particular" part is an application of Items ([2]) and ([3]) in Fact [H 



□ 



Lemma 2.6 (Entropy-sum). (Restated.) Let Xi, X2, . . . , X„ be — 1 random variables, and 
p < i < j three indices in [n] such that for i := {i — p) and d := j — i we have i > c - d for a 
sufficiently large c. Suppose that 

H{Xp^i, Xp+2, . . . , Xj\Xi, X2, . . . , Xp) > C -\- d — 1/ c. 

Then there exists an integer t such that 



Pr 

X 



^ Xfc > t + £/2 + d/2 + c^l'^Vd 



.k<j 



Pr 

X 



k<i 



Pr 

X 



J2 Xk > t + i/2 + d/2 + c^/'^Vd /\J2Xk < t + i/2 

.k<j 



k<i 



> 1/10, and 

> 1/10, but 

< 1/1000(< 1/10-1/10). 
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Proof. Let us start with the last inequahty, because we can prove it without getting our 
hands on t. First, by Item in Fact [H in particular the distribution of Xj+i, ■ ■ ■ ,Xj 
is 4:/^/c close to the uniform U1U2 ■ ■ ■ Ud- We have 



Pr 

X 



<Pr 

X 



> t + i/2 + d/2 + c^/'^Vd /\J2^k <t + i/2 



.k<j 
3 



k<i 



.k=i+l 



< Pr 

X 



^Uk> d/2 + c^l'^Vd 



,fc=i 



<l/2000 + 4/Vc< 1/1000, 



where the second to last inequality follows from Chebyshev's inequality for sufficiently large 
c. 

We now verify the first two inequalities in the conclusion of the lemma. Let Y : = 
Xi, X2, . . . , Xp stand for the prefix, and Z := Xp+i, Xp+2, ■ ■ ■ ,Xj for the I + d high-entropy 
variables. Let 

A:={y^ {0, ly : H{Z\Y = y) > £ + d - 2/c} 

be the set of prefix values conditioned on which Z has high entropy. We claim that Pr[y G 
A] > 1/2. This is because, applying Markov Inequality to the non-negative random variable 
i + d — H{Z\Y = y) (for y chosen according to F), 

Pr[r ^A]= Pr [£ + - H{Z\Y = y)>2/c]< 
Ey^Y[£ + d- H{Z\Y = y)]/{2/c) = {i + d- H{Z\Y))/{2/c) < (l/c)/(2/c) = 1/2. 

Note that for every ?/ G A we have, by definition, that the {^ + d)-bii random variable 
{Z\Y = y) has entropy at least i + d — 2/c, and so by Item [3] in Fact [1] the random variable 
{Z\Y = y) is (e := A^y 2 / c)-c\ose to uniform over {0, 1}^+"'. Therefore, for any subset 5* C A, 
the random variable 



{Z\Y E S) is e-close to uniform over {0, l}^^'^. 
Now define t to be the largest integer such that 



Pr 



r G A A ^ n > t 

k<p 



> 1/4. 



(15) 



(16) 



Since by definition of t we have Py[Y G A A X]fc<p^fc > t + 1] < 1/4, we also have 



Pr 



r G A A ^ n < t 

k<p 



> 1/2 - 1/4 = 1/4. 



;i7) 
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We obtain the desired conclusions as follows, denoting by Ui, U2, ■ ■ ■ , uniform and indepen- 
dent — 1 random variables. First, 



Pr 



>Pr 



Pr 



>Pr 



J2^k>t+{i + d)/2 + c^/^Vd 



lk<j 



^Xfc >t+{i + d)/2 + Vi/c 



1/6 



lk<j 



(Because i > c ■ d) 



^ n + J2 Zk>t+{i + d)/2 + Vf/c^/'^ 

.k<p k<e+d 



Zk>{^ + d)/2 + VI/c^^^ Y e AAj2'^k>t 

.k<e+d k<p 



Pr 



YeAAj2yk>t 

k<p 



> Pr 



5^ f/fc > {i + d)/2 + Vi/c^/^ 



.k<i+d 



(1/4) 



> (1/2 - ■ Q{1/VI) - e) (1/4) 

> (1/2 - e(l/ci/^) - e) (1/4) > 1/10, 



where the third inequality uses f|T5|) and f|T6|) . and and the fourth uses the standard es- 
timate Pr [Z,<,^,U, = (i + d)/2 + b] < Pr [Ek<u<iUk = [{^ + d)/2\] = Qil/VT+d) < 

e(l/v^) (cf. [CTO6I Lemma 17.5.1]). 
Second, 



Pr 



Pr 



> Pr 



J2Xk<t + i/2 

i 

^ Zfc < £/2 r G A A ^ Ffc < t 



5^n + $^^fe<t + ^/2 

k<p k<e 



k<e 



■ Pr 



> Pr 



J2Uk<i/2 



k<i 



e -(1/4) > (1/2- e)- (1/4) > 1/10 



for all sufficiently large c. Here the second inequality uses (HM and (ITTIl . 



□ 



4 Balanced brackets 

In this section we prove our lower bound for balanced brackets. We start by formally defining 
the problem and then we restate our theorem. 
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Definition 4.1 (Data structure for balanced brackets). We say that we store Bal := {x G 
{0, l}" : X corresponds to a string of balanced brackets} in [n]'^ supporting match queries by 
probing q cells if there is a map Enc : {0, 1}" [n]", n sets Q(l), . . . , Q{n) C [u] of size q 
each and n decoding functions di, . . . ,dn mapping [n^ to [n] such that for every x G {0, 1}'" 
and every i G [n] : 

Match{i) := index to bracket matching i = di [Enc{x)\Q(^i)) , 

where Enc{x)\Q(^i) denotes the q cells of size n of Enc{x) G [n]" indexed by Q{i). 

Theorem 1.2 (Lower bound for balanced brackets). (Restated.) To store Bal := {x G 
{0, 1}" : X corresponds to a string of balanced brackets} in [n]", n even, so that each Match{i) 
query can be computed by non-adaptively probing q cells of lg2 n bits, one needs memory 

u ■ \g2n > n — 1 + n/ Ig^' n, 

where A is an absolute constant. 



Overview of the proof: In the same spirit of the proof of Theorem 11.11 we show that a 
too efficient data structure allows us to break the dependencies between queries and gives 
the following contradiction, for some subset of inputs X and indices i < j: 

= Pr Match (z) > j /\ Match(j) < i 
Pr[Match(i) > j] ■ Pr[Match(j) < i] 

Above, the ffist equality obviously holds because i < j. The next breaking of the depen- 
dencies is again obtained with the separator lemma plus the information-theoretic lemma. 
For the final inequalities we find two indices i < j that are close to each other, and also 
make sure that the input bits between i and j have high entropy, and then we use standard 
estimates that bound these probabilities by e := — i). Whereas the corresponding 

probabilities in the proof of Theorem 11.11 can be bounded from below by a constant, here 
the bound deteriorates with the distance of the indices. This forces us to use the separator 
lemma with different parameters and overall yields a weaker bound. 

We need the following version of the separator lemma. 

Lemma 4.2 (Separator). Let c> A be any fixed, given constant. For all sufficiently large n, 
for alln sets Q{1),Q{2), . . . ,Q{n) of size q < (lglg?T,)/c each, there are two integers a,b>l 
such that 

c-a<b<c - {2cy, 

and n/\^n > 1, and there is a set B of size \B\ < n/\g^n such that there are > n/\g°'n 
disjoint sets among 

Qil)\B,Qi2)\B,...,Qin)\B. 
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Proof. Let d :— 2c, L :— \gn. Initialize B :— 0. Consider the following procedure with 
stages i = 0,1, . . . ,q. Wc maintain two invariants: (1) at the beginning of stage i the family 
Q(l) \ B, Q{2) \ B, . . . , Q{n) \ B consists of sets of size q — i and (2) at the beginning of any 
stage i > 1, 

\B\ < nlL--^"-', 

while at the beginning of stage i = we have \B\ = 0. 

The i-th stage: Consider the family {Q{1) \ B, Q{2) \B,. . . , Q{n) \ B). If it contains 

disjoint sets then we successfully terminate the procedure, because by the invariant \B\ < 

If not, there must exist a covering C of size {q — i) ■ n/L'^'' ' of the family, i.e., a set that 
intersects every element in our family. To see this, greedily collect in a set S as many disjoint 
sets from our family as possible. We know we will stop with \S\ < njL^'^ \ This means that 
every set in our family intersects some of the sets in S. Since the sets in the family have size 
at most (q — i), the set C of elements contained in any of the sets in S constitutes a covering 
and has size \C\ < {q — i) ■ njL'^'^ \ 

Let B := B[jC. We now finish the stage. Note that we have reduced the size of our 
sets by 1, maintaining Invariant (1). To see that Invariant (2) is maintained, note that by 
Invariant (2) and the bound on |C| we have 

\B\ + |C| < i • n/L''-''''' + {q-i)- n/L^"'' = q ■ n/L''''' < nlU-'^'~'~\ 

where the last inequality holds because it is equivalent to 

q . L'-^'-"' < L""-' 
<^= Ig g + c ■ Ig Ig n < (P'"- Ig Ig n 

^(lgg)/lglgn < d'^-' - c ■ d"-'-^ = d'^-' - d'^-'/2 = ^^-'12 

and d^~Y2 > 1/2 and since q < (lglgn)/c we have (lgg)/lglgn < 1/2 for large enough n. 

Note that the procedure successfully terminates at some stage i < q ai most, for at stage 
i — q our family consists of 

n > n/L^"'' = n/L 

empty sets which are all disjoint. 

To conclude, it only remains to verify the bound n/L'' > 1 ^ Ign > 61glgn. Observe 
that b = c - for some i = 0,1, . . . ,q. Therefore 

6 < c • d« < c(2c) (^sig n)/c < c(lgn)(^s'")/'^ < c{lgnf/^ =^ Mglgn < c(lgn)^/^lglgn < Ign 
where we use that c > 4 is fixed, and that n is sufficiently large. □ 
We then recall a few standard facts about balanced brackets. 
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Lemma 4.3. The number of strings of length n that correspond to balanced brackets is 
(n/2)/(^/2 + I), if n is even. 

Let X = {xi,X2, . . . , Xd) be uniformly distributed in {0, 1}^. Then 

Pr[xi is an open bracket and is not matched by X2, x^, . . . ,Xd\ 
= PT[xd is a closed bracket and is not matched by Xi, . . . , Xd-2, ^d-i] > «/ Vd, 

for a universal constant a > 0. 

Remarks on the proof. The first equahty is the well-known expression for Catalan numbers. 
We now consider the second claim in the statement of the lemma. This probability is easily 
seen to be at least 1/2 times the probability that a +1,-1 walk of length d — 1 starting 
at never falls below 0. Assuming without loss of generality that — 1 is even, the latter 
probability equals the probability that a +1,-1 walk of length d — 1 starting at ends at 
(see, e.g., (2) in www.math.harvard.edu/~lauren/154/Outlinel4.pdf). Standard estimates 
(cf. jUTOGl Lemma 17.5.1]) show this is > 0(1/ Vd). □ 

4.1 Proof of Theorem 11.21 

Let c be a fixed, sufficiently large constant to be determined later, and let n go to infinity. 
We prove the theorem for A := 2'^. Specifically, we assume for the sake of contradiction that 
there exists a representation with redundancy n/lg^'' n — 1 and we derive a contradiction. 
We clearly must have q > I. Also, note that we can assume that q < (Iglg n)/c, for else 
the redundancy is —1 + n/lg'^'n < —1 + n/\g^^^n < 0, which is impossible. Then, since 
q < (lglgn)/c, we can apply Lemma to the sets Q{1), . . . , Q{n) to obtain integers a,b > 1 
such that 

c-a<b<c-{2cy, (18) 

n/\^n > 1 and a set -B C [u] of size \B\ := n/\g^n such that there are at least n/\g°'n 
disjoint sets among 

Q{l)\B,Qi2)\B,...,Qin)\B. 

Let these sets be 

Qiv^)\B,Qiv2)\B,..., Q{v^i ig. „) \ 5, 

where we order 1 < f i < f 2 < • • • < "W^/ig^n < and let V := {fi, ^2, • • • ,'J^n/ig'"n} be the 
corresponding set of indices. Also define the parameter 

d := 161g"n. 

Over the choice of a uniform input x G {0, 1}", consider the most likely value z for the 
n/lg^n cells indexed by B. Let us fix this value for the cells. Since this is the most likely 
value, we are still decoding correctly a set X of |Bal|/n''^' inputs. From now on we focus 
on this set of inputs. Since these values are fixed, we can modify our decoding as follows. 
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For every i define Q'{i) := Q{i) \ B and also let d'^ be di where the values of the probes 
corresponding to cells in B have been fixed to the corresponding value in z. By renaming 
variables, letting u' := u — \B\ and Enc' : {0, 1}" [n]" be Enc restricted to the cells in 
[u] \B, we see that we are now encoding X in [n]" in the following sense: for every x E X 
and every i E [n]: 

Match(z) = < (Enc'(x)lQ/) , (19) 

where note for any i,jEV we have Q'{i) f] Q'{j) = 0. 

Uniform cells: To the choice of a uniform x G X C {0, 1}" there corresponds a uniform 
encoding y eY (1 [n]^ , where 

|X| = |y| > |Bal|/nl^l = |Bal|/2"/^s''-^n_ 

Let y = (i/i, . . . , be selected uniformly in y C [n]"'. By Lemma [2.41 with 7] := 1 / (c ■ d) = 
l/(c ■ 16 Ig" n) there is a set G C [u'] of size 

|G| >u'-16-2q. lg(n"7|r|) ■ c'd' > - 16 ■ 2g ■ Ig " c'd' 



u' -32q-r- c^d^, 



where r := (u Ign) — Ig |Bal| < n/ Ig"^"* n — 1 is the redundancy of the data structure, such that 
for any 2q indices ki,k2, . . . , the cells y^^, . . . , y^^^ are jointly (l/(c • (i))-close to uniform. 
Since the sets Q' {vi),Q' {V2), . . . C [u'] are disjoint, there is a set V2 C V such that 

for any i,j G V2 and y uniform in Y the distribution 

{y\Q'{i).y\Q'(j))y&. is l/(c ■ d) = l/(c ■ 161g"n) close to uniform over [n]l«'«l+l«'(^)l, (20) 

and the size of IV2I is 

\V2\>\V\-32q.r.c'.d'>-^-32q.-^.c'.d'>-^^ (21) 

Ig n \g n 2\g n 

where the last inequality (12T!) holds because, recalling d = IQ lg° n, it is implied by Ig'^''"^" n > 
(16)^ ■ 64g ■ which is true because of the bounds on g, c, n, using that A := 2^^, a < (2c)''. 

Close: Order the indices in V2 as f ^ < f 2 < . . . < v'^/{2\g'' n)- Consider the consecutive 
> [IV2I/2J pairs {f ^, fg}, {^3, ^4}, . . . Throw away all those such that the distance of the 
corresponding indices is > d. Since V2 C [n], we throw away at most n/d pairs. Put the 
indices of the remaining pairs in V3. So V3 contains at least 

|^3|/2> [1^21/2] -n/rf > n/(81g'*n) -72/(16 Ig'^n) > n/(161g''n) (22) 

of these pairs (and twice as many indices). 

Entropy in input hits: For a uniform x G X, let 

X = Z1Z2 ■ ■ ■ ^1^31/2 
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where the variables Zy^ are a partition of x in consecutive bits such that each contains 
exactly one pair from V3. We now apply the information-theoretic Lemma [2.51 with 4-y/e : = 
l/(cv^) (i.e., e = l/(16c2 ■ d)) and using the bound on |X| 



IXI > iBallMl^l > 



(n/2) 



> ^ 



2" 



> 



(n/2 + l)2"/ig^-^n - Vv^(n/2 + l)2-/is''"'"y " n2 • 2-/ig'-'"' 
where we use Lemma 14.31 and that n is sufficiently large, which implies 

lg(2"/|X|) < 21gn + n/lg''~^r2 < n/lg^'^n 
since n/\g^n > 1. The lemma guarantees that all but 



16- 



n 



1; 



3-2 



n 



■c^-d= {16f 



n 



1 b-2-a 

Ig n 



■ c 



(23) 



variables will be 1/ (cy/d) close to uniform. Since by Equation (1221) we have at least 
IV3I/2 > 72/(16 lg"n) variables Z^, andfe > c-a, there exists a variable that is l/(cv^) close 
to uniform. This variable contains one pair from V3. Let i < j & [n] be the corresponding 
indexes, which recall satisfy j — i < d. Let U' denote the uniform distribution on the u' cells. 
We have the following contradiction: 



= Pr 
= Pr 

yeY 

> Pr 

U' 



Match(z) > j /\ Match(j) < i 



(Because i < j) 
(By Equation ^) 
-l/{c-d) {Bym) 



Pr[<(f/'|Q;)>j]-Pr[4(f^' 



l/{c-d) (Because Q'{i) f] Q'{j) = 0) 



> ( Jr [<(y|Q,) > j] - l/(c ■ d)j \^Ft_ [d;(y|Q, ) < - l/(c ■ d)j - l/(c ■ d) 
(By again) 

Pr [Match(i) > j] - l/(c ■ rf) ) ( Pr [Match(j) < i] - l/(c ■ rf) ) - l/(c ■ c/) 
(By Equation (IT^ again) 

> 1 Pr [Match(i) > j] - 2/(c ■ Vd)] ( Pr [Match(j) < i] - 2/(c ■ v^) ) - l/(c ■ d) 
yxe{o,i}" y ya;G{o,i}" y 

(because is 1/ (cy/d) close to uniform) 

> (n{l/Vd) - 2/(c ■ v^)) (n{l/Vd) - 2/(c ■ v^)) - l/(c ■ c/) (By Lemma Mi) 

> 0. (For large enough c) 
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5 Open problems 



One open problem is to handle adaptive probes. Another is to prove lower bounds for the 
membership problem: to our knowledge nothing is known even for two non-adaptive cell 
probes when the set size is a constant fraction of the universe. The difficulty in extending 
the results in this paper to the membership problem is that the correlations between query 
answers are less noticeable. 

Acknowledgments. We thank Mihai Patra§cu for a discussion on the status of data 
structures for prefix sums. 
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